0%

一次不接受Elasticsearch官方建议导致的事故

记录一下
一次Elasticsearch集群事故分析、排查、处理

背景介绍

事故发生的ElasticSearch集群共有7台机器:

  • 127.0.204.193
  • 127.0.204.194
  • 127.0.204.195
  • 127.0.220.73
  • 127.0.220.74
  • 127.0.220.220
  • 127.0.220.221

其中193、194、195的机器配置一样,具体如下:

  • CPU:32核
  • 内存:128G
  • 磁盘:4T*3

系统盘单独挂载:40G

73、74、220、221的机器配置一样,具体如下:

  • CPU:32核
  • 内存:128G
  • 磁盘:10T

系统盘单独挂载:50G

以上7台机器用的都是阿里云的高效云盘,https://help.aliyun.com/zh/ecs/user-guide/disks-2
也就是说最大吞吐量(读+写 上限)为140MB/s

问题

由于比较穷,所以集群的部署情况比较复杂,这7台机器上共有一个Kafka集群、一个ZooKeeper集群、两个ElasticSearch集群。
其中Kafka、ZooKeeper集群部署在193、194、195上;
两个ElasticSearch集群各有7个节点(也就是这个7台机器每个都有两个ElasticSearch集群的实例);
193、194、195上的3块盘,Kafka、ZooKeeper、ElasticSearch(2个)集群都在用。

在这次事故之前发生过一次Kafka集群写入延迟,经分析是两个ElasticSearch集群与Kafka集群公用一块磁盘导致磁盘io满,造成Kafka集群集群延迟。

所以将193、194、195上的3块盘的使用调整为:

  • 一块4T的盘单独给Kafka、ZooKeeper使用
  • 另外两块4T的盘两个ElasticSearch集群混用(其中一个ElasticSearch集群的数据量,已经很小了,不到10G,所以可以忽略)

好了,前情回顾已经写完了,下面来说下这次出现的问题:

  • 23号下午5点,ElasticSearch集群挂了
    • 集群没有master
  • 25号上午9点,ElasticSearch集群又开始rebalance shards

分析

因为23号的问题很紧急,所以当时采取的方案就是重启整个集群,先恢复。

193上当时没有master的异常日志

1
2
3
4
5
6
{"type": "server", "timestamp": "2023-10-23T09:22:08,955Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "business-log", "node.name": "es-b-193", "message": "master not discovered or elected yet, an election requires at least 4 nodes with ids from [hYrrmhLHTx-QHoDmZ2wATg, jaNLhd1eT6SUYcLJkHpE1Q, 1VQFmt9jQ-6d7fCMjP-vnQ, uzGdH3VeRbOlcDIWeKdgIw, PhjlCea6TNKh4rdZPyPDkA, ZFQdNP4HSgCtPCnMfDdopw, 1WRaRyU-SuCPn8qbz3e4hg], have discovered [{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{k0x5Rog-S0CFhAmorJ3V0Q}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}, {es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{127.0.220.221}{127.0.220.221:9301}{cdfhilmrstw}, {es-b-194}{PhjlCea6TNKh4rdZPyPDkA}{6pQJDRevScad3hr_hJcnTg}{127.0.204.194}{127.0.204.194:9301}{cdfhilmrstw}, {es-b-220}{hYrrmhLHTx-QHoDmZ2wATg}{Qyz9MF2bRmKQq9uPWgTBuw}{127.0.220.220}{127.0.220.220:9301}{cdfhilmrstw}, {es-b-195}{1VQFmt9jQ-6d7fCMjP-vnQ}{e_P2NNXnTIWGzg7vE8EWkA}{127.0.204.195}{127.0.204.195:9301}{cdfhilmrstw}, {es-b-74}{uzGdH3VeRbOlcDIWeKdgIw}{fhNTddq7Syi7zPDVCejDWQ}{127.0.220.74}{127.0.220.74:9301}{cdfhilmrstw}, {es-b-73}{1WRaRyU-SuCPn8qbz3e4hg}{1iRrVMWcSNGGxsuQYxePSg}{127.0.220.73}{127.0.220.73:9301}{cdfhilmrstw}] which is a quorum; discovery will continue using [127.0.204.194:9301, 127.0.204.195:9301] from hosts providers and [{es-b-220}{hYrrmhLHTx-QHoDmZ2wATg}{Qyz9MF2bRmKQq9uPWgTBuw}{127.0.220.220}{127.0.220.220:9301}{cdfhilmrstw}, {es-b-73}{1WRaRyU-SuCPn8qbz3e4hg}{1iRrVMWcSNGGxsuQYxePSg}{127.0.220.73}{127.0.220.73:9301}{cdfhilmrstw}, {es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{k0x5Rog-S0CFhAmorJ3V0Q}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}, {es-b-74}{uzGdH3VeRbOlcDIWeKdgIw}{fhNTddq7Syi7zPDVCejDWQ}{127.0.220.74}{127.0.220.74:9301}{cdfhilmrstw}, {es-b-194}{PhjlCea6TNKh4rdZPyPDkA}{6pQJDRevScad3hr_hJcnTg}{127.0.204.194}{127.0.204.194:9301}{cdfhilmrstw}, {es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{127.0.220.221}{127.0.220.221:9301}{cdfhilmrstw}, {es-b-195}{1VQFmt9jQ-6d7fCMjP-vnQ}{e_P2NNXnTIWGzg7vE8EWkA}{127.0.204.195}{127.0.204.195:9301}{cdfhilmrstw}] from last-known cluster state; node term 43, last-accepted version 397615 in term 43", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw"  }
{"type": "server", "timestamp": "2023-10-23T09:22:11,119Z", "level": "WARN", "component": "r.suppressed", "cluster.name": "business-log", "node.name": "es-b-193", "message": "path: /_cat/nodes, params: {h=ip,name,heap.percent,heap.current,heap.max,ram.percent,ram.current,ram.max,node.role,master,cpu,load_1m,load_5m,load_15m,disk.used_percent,disk.used,disk.total}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw" ,


org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];\",\n","stream":"stdout","time":"2023-10-23T09:21:34.490903538Z"}
{"log":"\"at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:179

恢复后,排查日志,主要看到的现象就是193节点频繁的added、removed

1
2
3
added {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{k0x5Rog-S0CFhAmorJ3V0Q}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}, term: 43, version: 397620, reason: ApplyCommitRequest{term=43, version=397620, sourceNode={es-b-73}{1WRaRyU-SuCPn8qbz3e4hg}{1iRrVMWcSNGGxsuQYxePSg}{127.0.220.73}{127.0.220.73:9301}{cdfhilmrstw}{ml.machine_memory=133070966784, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=33285996544, transform.node=true}}

removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{k0x5Rog-S0CFhAmorJ3V0Q}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}, term: 43, version: 397621, reason: ApplyCommitRequest{term=43, version=397621, sourceNode={es-b-73}{1WRaRyU-SuCPn8qbz3e4hg}{1iRrVMWcSNGGxsuQYxePSg}{127.0.220.73}{127.0.220.73:9301}{cdfhilmrstw}{ml.machine_memory=133070966784, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=33285996544, transform.node=true}}

导致193上的shard一直rebalance。但193上日志,不知道什么原因丢失了,能找到的最早的日志也是23号下午5点21分之后的日志了。经过日志排查问题到这里被卡主了。

回看发生问题前的一段时间各个节点的监控,CPU、内存、磁盘、写入量等都没有异常的波动,唯一发现的一个异常:

那么generic thread pool是做啥的呢?

=> 用于通用操作(例如,后台节点发现)它的线程池类型是动态缩放的

到这里也只是能印证了,节点间相互发现有问题。那么到底什么原因造成的集群宕机呢?

时间来到了今天上午也就是25号9点左右的时候,发现集群又开始rebalance shards,通过

1
GET /_cat/shards?h=index,shard,prirep,state,unassigned.reason

可以看到shard重新迁移的原因是node-left,到master节点的机器看下日志:

1
2
3
4
5
root@jiankunking-es-02:~# docker logs -f 1f741600dbcc |grep node-left
{"type": "server", "timestamp": "2023-10-23T09:33:18,042Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{127.0.220.221}{127.0.220.221:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 399161, delta: removed {{es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{127.0.220.221}{127.0.220.221:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:52:35,140Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw} reason: followers check retry count exceeded], term: 52, version: 404805, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:53:24,564Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 404807, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:53:30,879Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 404810, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }

按理说 193 194 195这些机器与另外4台机器相比,区别点在于挂了多块盘,那性能应该更好啊,为啥连不上的会是他们呢?

又回看了下节点日志,发现节点间互连存在大量的超时

master(194)节点日志:

1
2
3
4
5
{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s][r]}{70758574}{false}{false}{false}}] of size [21075] on [Netty4TcpChannel{localAddress=/127.0.204.194:57580, remoteAddress=127.0.204.193/127.0.204.193:9301, profile=default}] took [29403ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }
{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{70758608}{false}{false}{false}}] of size [1992] on [Netty4TcpChannel{localAddress=/127.0.204.194:57580, remoteAddress=127.0.204.193/127.0.204.193:9301, profile=default}] took [29403ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{70758614}{false}{false}{false}}] of size [5563] on [Netty4TcpChannel{localAddress=/127.0.204.194:57580, remoteAddress=127.0.204.193/127.0.204.193:9301, profile=default}] took [29403ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s][r]}{70758686}{false}{false}{false}}] of size [9103] on [Netty4TcpChannel{localAddress=/127.0.204.194:57580, remoteAddress=127.0.204.193/127.0.204.193:9301, profile=default}] took [28603ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s][r]}{70758737}{false}{false}{false}}] of size [12666] on [Netty4TcpChannel{localAddress=/127.0.204.194:57580, remoteAddress=127.0.204.193/127.0.204.193:9301, profile=default}] took [28603ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }

普通节点(193)

1
2
3
4
{"type": "server", "timestamp": "2023-10-25T01:34:46,866Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-193", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{73855747}{false}{false}{false}}] of size [3828] on [Netty4TcpChannel{localAddress=/127.0..204.193:38616, remoteAddress=127.0.220.221/127.0.220.221:9301, profile=default}] took [35027ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw"  }
{"type": "server", "timestamp": "2023-10-25T01:34:46,866Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-193", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{73855763}{false}{false}{false}}] of size [14563] on [Netty4TcpChannel{localAddress=/127.0.204.193:38616, remoteAddress=127.0.220.221/127.0.220.221:9301, profile=default}] took [35027ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw" }
{"type": "server", "timestamp": "2023-10-25T01:34:46,866Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-193", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{73855795}{false}{false}{false}}] of size [12940] on [Netty4TcpChannel{localAddress=/127.0.204.193:38616, remoteAddress=127.0.220.221/127.0.220.221:9301, profile=default}] took [35027ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw" }
{"type": "server", "timestamp": "2023-10-25T01:34:46,866Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-193", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s][r]}{73855813}{false}{false}{false}}] of size [10721] on [Netty4TcpChannel{localAddress=/127.0.204.193:38616, remoteAddress=127.0.220.221/127.0.220.221:9301, profile=default}] took [35027ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw" }

看到到这里难道是网络或者网卡打满了?

看下带宽Bandwidth:

1
2
3
root@jiankunking-es-02:~# ethtool eth0 | grep Speed
Speed: Unknown!
root@jiankunking-es-02:~#

再看下网卡实时流量

1
iftop -n

找负责机器的同事核对,得到以下信息

  • 瓶颈是按照网卡上限看,不看两个服务器之间链路 =>也就是服务器间的网络带宽应该是远大于机器网卡的,所以服务器的网卡更容易成为瓶颈
  • 服务器网卡上限是:内网带宽(bit/s) 10.00Gpbs=>也就是服务器网卡总流量的上限是 10/8 大约是1G的流量

同时要了下阿里云的监控图

从阿里云的监控可以判断出这时候网卡并没有满的(进+出,也就能到阈值10%的样子),我们再看下23号傍晚的阿里云监控

这里的要注意下,对于阿里云来说,比如网卡有阈值,但短时间阿里云是允许突破限制的(虽然这里并没有满)

既然网络没有问题,那问题出在哪?

算了,找一个节点频繁added remove的时间在master(193)上抓包一下

从抓包中可以看出以下信息:

  • 网络连接正常,建立连接–发送数据–连接关闭
  • 是master节点主动断开的连接

但是master节点为什么断开的连接,就不清楚了。追查到这里,感觉问题应该不是在网络还是es自身哪里出了问题。

继续一顿Google,找到了这么一个issue:https://github.com/elastic/elasticsearch/issues/67873

虽然各种异常提示都不太一样,但现象是一样的

在issue中发现shard数太多会对于集群造成一些不利的影响。

我们回看了下集群shard监控

发现shard数一直在增加,整个集群已经到达6w+。要不我们也删除部分历史索引试试?

这时候74节点一直在added removed,我手动重启了74节点,并将尝试将7、8月份的索引删除,集群逐步恢复到70%+的时候,又有节点下线了;

我们再次尝试删除9月份加10月份1号-6号的索引,这时候集群恢复成功了(这时候集群shard数已经从6w降到了4w)。

问题暂时得到了处理,观察一段时间再没有发现node-left的情况,但根本原因并没有找到。(比如23号重启集群后,shard数也是接近6w,但也重启成功了,但25号却一直没有成功)。

如果大家有这方面的资料或者最佳实践,欢迎留言。

小结

经过这件事之后,我最大的一个感触就是官方的推荐值还是要参考一下:
Elasticsearch 集群内应该设置多少个分片(shard)?

官网的建议(针对7.X)

1
2
3
4
5
6
Aim for 20 shards or fewer per GB of heap memory

The number of shards a data node can hold is proportional to the node’s heap memory.
For example, a node with 30GB of heap memory should have at most 600 shards.
The further below this limit you can keep your nodes, the better.
If you find your nodes exceeding more than 20 shards per GB, consider adding another node.

主要是因为要控制成本所以只要集群不出问题,官方的建议,我们一般是这么个态度:

其实对于shard大小跟每个节点推荐的数据量一样,我们都没有接受他们的建议。比如这次排查为什么没有怀疑过30T数据量太多呢?因为之前都是40T+。

其实经过这件事,还有以下几个点想分享一下

  • 各种集群间应该资源隔离,也就是机器不要共用
  • 不要在各种边缘上运行
    • 比如我们es的磁盘其实一直就是瓶颈,只要触发rebalance shards,磁盘io一定是满的,就会导致写入速率下降等各种问题
  • 集群内各个相同功能的节点配置要一样,比如我们的es集群,磁盘大小不一样,当到达磁盘阈值的时候,集群负载就会不均衡

按照8.3+估算内存使用量

索引数:4300
分片数:6w
mapping字段总数:23w
60000*(230000/4300)*1kb=3209302 KB =3GB 堆空间

欢迎关注我的其它发布渠道